Data Analytics using Python with JupyterLab
Data analytics helps a business perform more efficiently, maximize profits, and make strategically guided decisions. It is an important stage in a pipeline of Data Pipeline Studio (DPS) that uses machine learning algorithms for analyzing the available data.
Data Pipeline Studio (DPS) provides the option of either using predefined algorithms like Random Forest Classifiers, Support Vector Classifier or creating custom algorithms according to the specific requirement.
DPS supports Python with JupyterLab for the data analytics stage with Snowflake or Amazon S3 as a data lake. You can create a data pipeline with the following combinations:
-
Data Lake (Snowflake) > Data Analytics (Python with JupyterLab) > Data Lake (Snowflake)
-
Data Lake (Snowflake) > Data Analytics (Python with JupyterLab) > Data Lake (Amazon S3)
-
Data Lake (Amazon S3) > Data Analytics (Python with JupyterLab) > Data Lake (Amazon S3)
-
Data Lake (Amazon S3) > Data Analytics (Python with JupyterLab) > Data Lake (Snowflake)
- On the home page of DPS, add the following stages and connect them as shown below:
- Data Lake: Snowflake
- Data Analytics: Python with JupyterLab
- Data Lake: Amazon S3
- Configure the Amazon S3 and Snowflake nodes. Ensure that JupyterLab has access to the S3 node in the data pipeline. See Setting up S3 access from JupyterLab.
-
Click the Python with JupyterLab node. Click +Algorithm.
-
Provide the following information:
-
Algorithm Type: Select a Predefined Algorithm or Custom Algorithm to be used for data analytics.
-
Algorithm Name: Provide a unique name for the algorithm.
-
Source: Select a table in the source, on which the algorithm will be run.
-
Algorithm Label: Select a column from the pre-populated list of the source table. This is the column on which the selected algorithm is run.
Note:
Ensure that the selected column contains binary data only.
-
Target: Select a table in which the data analytics results will be pushed. You can either select an existing table or create a new table by adding the name <xyz> and then clicking Create <xyz>.
-
Click Add.
-
-
After you click Add, the Creation Status displays the following types of status depending on whether the data analytics job for the selected algorithm is created or not:
-
In Progress - when the job is being created.
-
Success - when the job is created successfully.
-
Failed - when the job creation fails.
-
-
To add multiple algorithms, repeat steps 3-5.
-
To run the algorithms, choose one of the following options:
-
Publish the pipeline. Click Run to run all the algorithms added to the pipeline.
-
Click Run to run the individual algorithm.
-
Click Run All to run the complete list of algorithms.
-
-
The Run Status of the pipeline can be one of the following:
-
Running
-
Success
-
Failed
-
Terminated
-
Aborted
-
-
Click to view the dashboard. View the dashboard generated from the pickle file.
In case the algorithm fails and you need to troubleshoot it, then click the specific algorithm. Before you do that, click the ellipsis (…) adjacent to +Algorithm and click Copy JupyterLab Token. When you click a specific algorithm, it navigates you to JupyterLab Notebook, you need the token to log in. Once you have logged in, you can make troubleshoot the code.
You can check the logs for the data analytics jobs that are created in Data Pipeline Studio. For each data analytics job, a Jenkins job is triggered. You can navigate to Jenkins from the Lazsa Platform to view the logs.
-
Click the data analytics node. On the side drawer, click CI/CD Pipeline.
-
Click Deploy.
-
On the Jenkins screen click Console Output in the left navigation pane.
-
Search for Lazsa-Core-Jobs run-ml-algorithms. You can view the logs for a specific job.
There are multiple ways to navigate to Jenkins logs from the Lazsa Platform.
If your JupyterLab algorithm shows the run status as Failed, do the following:
-
Click the Run Status. This navigates you to the Jenkins logs for the specific job. You may see the following error:
-
In the Lazsa platform, on the home page of Data Pipeline Studio, click the pencil icon to switch the pipeline into edit mode.
-
Click the Python with JupyterLab node.
-
On the Python with JupyterLab side drawer, click Start Coding. You are navigated to JupyterLab Notebook.
To install Papermill in JupyterLab Notebook terminal
-
Copy the JupyterLab token from the section above. Use the token and log in to your JupyterLab application.
-
On the File menu, click New > Terminal. This launches a JupyterLab terminal.
-
Run the following command in the terminal:
Copypip install papermill
-
Run the following command to check the papermill version:
Copypapermill-version
-
Once the papermill package is installed successfully, run the data pipeline.
-
What's next? Snowflake Custom Transformation Job |